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Abstract 

In this report, we proposed a 3D reconstruction method 
for the full-view fisheye camera. The camera we used is Ri¬ 
coh Theta, Fig. [ 7 ] which captures spherical images and has 
a wide field of view (FOV). The conventional stereo appo- 
rach based on perspective camera model cannot be directly 
applied and instead we used a spherical camera model to 
depict the relation between 3D point and its corresponding 
observation in the image. We implemented a system that can 
reconstruct the 3D scene using captures from two or more 
cameras. A GUI is also created to allow users to control 
the view perspective and obtain a better intuition of how 
the scene is rebuilt. Experiments showed that our recon¬ 
struction results well preserved the structure of the scene in 
the real world. 


1. Introduction 

Wide field of view (fisheye) camera has received increas¬ 
ing attention over the past few years with its broad applica¬ 
tions in surveillance, robotics, intelligent vehicles, immer¬ 
sive virtual environment construction, etc. For example, 
Nissan Motors developed a visual system that consists of 
four fisheye cameras mounted on the four sides of the vehi¬ 
cle. They together cover the entire 360° surrounding scene 
and allow drivers to examine all the visual blind spots that 
may cause danger. In surveillance, IP fisheye camera has 
become extremely prevalent for its wide cover range and 
easy axcessibility. Samsung provides a product with over 5 
megpixel and 360° FOV, which is equipped in an alarm sys¬ 
tem performing intelligent motion detection, audio detec¬ 
tion, and tampering detection. The supporting de-warping 
software allows users to undistort any subregion in the cap¬ 
tured image. 

Recently, Ricoh unveilled its first personal 360° fisheye 
camera — Ricoh Theta. Two fisheye cameras are embed- 



Figure 1. Ricoh Theta camera 



Figure 2. Spherical image captured by Ricoh Theta 


ded on both front and back sides, to capture the entire scene 
with one click. Then the two captured images are stitched 
together to provide a dynamic 360° view with adjustable 
perspective controlled by the user. With this portable and 
handy device, our project aims to reconstruct the 3D scene 
using the captured spherical images. One important advan¬ 
tage of using this camera is that we are no longer required 
to set up multiple traditional cameras at different locations 
and directions to cover the entire scene. As a tradeoff, tradi- 
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tional camera model with perspective projection cannot be 
directly applied since fisheye camera has large radial dis¬ 
tortion, especially near the border. To establish a one-to- 
one mapping between the 180° scene and a circular image, 
we created a model based on spherical projection. Based 
on this model, we can develop the epipolar geometry for 
fisheye cameras and solve the triangulation problem with 
least-square method. We used manually selected points to 
calculate the fundamental matrix, then applied it as a fil¬ 
ter to prune the SIFT [[^ matching result, at last augmented 
the point correspondences for reconstruction. On the other 
hand, we also tried dense reconstruction by first doing im¬ 
age rectification and then calculating the disparity map. 

In Sec. 1^ we will briefiy review previous work about 3D 
reconstruction with fisheye camera. Then in Sec. [^we will 
jump into the details of our camera model, revised epipolar 
geometry and data augmentation. Extension to multicam¬ 
era registration and dense reconstruction will also be illus¬ 
trated. In Sec. 15 we first show the reconstruction result us¬ 
ing hand-picked points, then we show the SIFT augmented 
result. Next, we give the disparity map and dense recon¬ 
struction result. Finally, we will show a snapshot of our 
GUI and provide the source code package for users to taste. 

2. Related Work 

2.1. Previous work 

Perspective camera model is the most popular camera 
model in 3D reconstruction. However, it is limited for its 
narrow field of view. On the other hand, fisheye cameras 
which can capture spherical images have been paid more 
attention to during recent years. The major advantage is 
the wide FOV and thus more information it can incorporate 
from the environment. 

Shah and Aggarwal j^l presented an autonomous mobile 
robot navigation system in an indoor environment using two 
calibrated fisheye sensors. Micusik et al. j^l proposed a 3D 
reconstruction of the surrounding scene with two or more 
uncalibrated fisheye images. Li 0 drew 3D reconstruction 
by computing spherical disparity maps using binocular fish¬ 
eye camera, which first calibrated the binocular camera to 
rectify the captured images and then used the correlation- 
based stereo to acquire the dense 3D representation of some 
simple environment. Herrera et al. j^l and Moreau et al. 
j^l placed the camera upwards and retrieved the environ¬ 
ment information from the images. They computed dispar¬ 
ity maps without image rectification step. 

2.2. Project contribution 

• Proposed the camera model and epipolar geometry for 
fisheye camera. 

• Designed a method to estimate camera rotation and po¬ 
sition from point correspondences in multiple images. 


Figure 3. Fisheye camera model 



Figure 4. Epipolar geometry 


• Implemented SIFT feature extraction and matching al¬ 
gorithm through equirectangular-to-cube mapping. 

• Proposed sparse & dense 3D reconstruction algorithm 
from multiple images. 

• Developed a graphical user interface to interactively 
show multiple correlated 360° images. 

3. Method 

In this section, we go over the mathematical model be¬ 
hind this project. It mainly consists of four parts, the fisheye 
camera model, epipolar geometry, multicamera registration 
and image rectification for dense reconstruction. 

3.1. Fisheye camera model 

The fisheye camera model is based on spherical projec¬ 
tion. Suppose there is a sphere of radius fg and a point P in 
space, as shown in Fig. First, P is projected to p* which 
is the intersection of the sphere surface with the line defined 
by sphere center O and point P. This defines a mapping be¬ 
tween spatial points to points on the sphere surface. Then, 
these points are vertically projected onto the image plane as 
p* is projected to p, which results in a circular image. In 
mathematical term, let P = [Xp, Yp^Zp]^, then we have 
p* = [fs sin 0 cos 6>, fs sin 0 sin 6>, fg cos 0]^. The relation 
between P and p* is 

p* = XP 

where A = fg/p and p = '^Jx^ The verti¬ 

cal projection reduces the Z component to 0, and we get 
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p = [fs sin 0 cos 0^ fs sin 0 sin ^,0]^. Here, we let /^ = 1 
which means we project onto a unit sphere. 

The raw images acquired by Ricoh Theta, shown in Fig. 
[^are in equirectangular form with resolution 1024x2048, 
i.e. the (x, image coordinates represent the longitude and 
the latitude on the unit sphere. 

(p = ^/1024 * TT 

0 = x/1024 * TT 

p* = [sin(/)cos 6 >, sin(/)sin 6 >, cos(/)]^ 

3.2. Epipolar geometry 

Now, assume there are two cameras centered at 0 and T, 
as shown in Fig. There is a point P in 3D space. Then, 
for camera 1 , the projection on spherical surface is pj = 
P/ ||P||; for camera 2, the projection on spherical surface 
(in world coordinates) is P 2 = T + (P — T)/ ||P — T||. 
Without loss of generality, assume the reference system of 
camera 1 is the same as the world reference system, and the 
rotation and translation between camera 1 and camera 2 is 
R and T. Then, we have Zp^i = Ip\,Zp ^2 = R{P 2 ~ 
where Zpp and Zp ^2 are the coordinates of p^ andp 2 in their 
cameras’ reference system. 

Notice, now we have five coplanar points: camera cen¬ 
ters 0 i, 02 , Pi^P 2 R- Thus, we have the constraint 
^ X = 0.1.e. (pt)^(T X (p^ - T)) = 0. 

Substitute pf, P 2 with Zpp and Zp^ 2 ^ we get, 

^p,i -[Tx]- R~^Zp^2 = 0 

Define P = [Tx] • R~^ as the fundamental matrix for fish- 
eye camera pair, we have constraint Zp iFzp ^2 = 0. Now, 
we can use the eight-points algorithm or RANSAC to solve 
for P. Once we get the fundamental matrix, we can calcu¬ 
late the epipoles in the two cameras by solving P^ei = 0 , 
Pe2 = 0. 

Recall that the definition of epipoles is ei = T/ ||T||, 
02 = —PT/||T||, which gives 62 = —Pei. Then the 
rotation matrix P can be derived as, 

R = I P [v]x P [v]x 

where v = (—ei) x 62 , 5 = ||'e||, c = —efe 2 . Here we 
assume the Euclidean distance between Oi and O 2 is 1, i.e. 
T = ei. 

Now, we can triangulate P using parameters Zp^i, Zp^ 2 , 
ei, and P. We define the line passing through Oi and Zpp 
as azp^i, where a G M; the line passing through O 2 and 
Zp ^2 as bR~^Zp ^2 + ei, 6 G M. The goal of triangulation is 
to find the minimal distance between the two lines. We can 
formulate this into a least square problem, 

minimizea^5 \\ciZp^i — hR~^Zp^2 — ei|| 


where the optimal solution is given by. 


a* 


{A^A) ^A^ei, A= [ Zp^i -R '^Zp^2 ] 


Once we get the optimal parameter a*, IP. The mini¬ 
mal distance is known to be achieved between a^Zpp and 
PR~^Zp ^2 + ei, then P can be assigned as the their middle 
points: 

P = {oPZpp + bPR H“ ^i)/2 

3.3. P estimation & data augmentation 

In order to calculate P, we must have enough point 
correspondences in multiple images. In our methods, we 
manually selected around 45 pairs of corresponding points. 
We also attempted to automatically estimate P by applying 
RANSAC with constraint Zp iFzp ^2 = 0 on SIFT match¬ 
ing results, to find the best estimation of P. However, SIFT 
matching is not invariant to radial distortion and the match¬ 
ing results have unacceptable outliers, thus the estimation 
of P is not robust enough. Instead, we estimated P by us¬ 
ing hand-picked points, and in turn use P to filter the SIFT 
matches and extend our point pairs pool. 

3.4. Multicamera registration 

Next, we extend the discussion to multi-view scenario. 
From the section above we can obtain the fundamental ma¬ 
trix and epipoles for each pair of cameras, but we can no 
longer assume the Euclidean distance between camera cen¬ 
ters is 1. Now we want to estimate the rotation matrix and 
camera position for each camera. This can be done in a 
two-step process. 

First, we estimate the rotation for each camera. Assume 
we have n cameras, for each pair of camera i and j, we can 
calculate the epipoles eij and eji, which denotes Oj on im¬ 
age i and Oi on image j, respectively. Here, we assume the 
cameras all lie on the same horizonal plane, which is a very 
good approximation of how we took pictures. Therefore, 
the rotation of each camera can be represented by an angle 
Oi . The relation between Oi and rotation matrix Ri is 


Ri = 


cos{0i) —sm{0i) 0 

sin(6>^) cos(^^) 0 

0 0 1 


The epipole direction in world coordinate is 

^ij,w — Ri ^ij 


^ji,w 

^ij,w 


— Rj eji 


±e 


Jl,W 


The last line should be obvious as they both denote the 
direction of the line segment defined by Oi and Oj. The 
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Figure 5. Image rectification pipeline 


± sign indicates the two-fold ambiguity in calculating eij 
from the fundamental matrix. 

Now we need to minimize the objective function, 


so that their epipolar lines are horizontal and all corre¬ 
sponding points have the same vertical coordinate on the 
image. As we know, the epipolar lines in spherical im¬ 
ages are circles which intersect with the epipoles. There¬ 
fore, if we rotate the camera reference such that the Z axis 
align with the epipole and the XX axis are parallel, and 
map the sphere onto equirectangular image, then epipolar 
lines would be vertical lines in equirectangular images, as 
show in Fig. By exchanging the horizontal and ver¬ 
tical coordinates, the image pairs will be rectified. From 
the rectified image pairs we can calculate a disparity map 
D, which is the distance (in pixels) between corresponding 
points on the image pair. As the images are equirectangu¬ 
lar, D is the angle ZO 1 PO 2 by a constant. Assume the 
corresponding points are {xi,y) and {x 2 ,y) respectively, 
and d = X 2 — xi, then ||OiP|| = T x sin(x 2 )/ sm{d), 
II02^11 = T X sin(xi)/sin((i), where T = IIO 1 O 2 1|-From 
that we can calculate the 3D coordinates of point P. 


E 


1 \^ij,w ' ^ji,w I 


which is a convex optimization problem, and we solve it by 
Newton’s method. 

Next, we estimate the position of each camera. As we 
have the direction of each line segment OiOj, this is a tri¬ 
angulation problem. A naive way to solve this problem is to 
choose two cameras, e.g. Oi and O 2 , and set the Euclidean 
distance between them to be 1. Then for each camera other 
than Oi and O 2 , its position can be triangulated from the 
direction of OiOi and 020j. We can repeat the procedure 
with different choice of baseline to check the consistency. 
We can also feed the result into another gradient descent 
program to adjust the camera positions using all directions 
OiOj obtained. 

Now that we have recovered the rotation and translation 
of each camera, the object points can be triangulated in a 


similar way as described in Sec. 3.2 In the multiview 
case, we assign as the distance between object point and 
camera center Oi, and minimize the mean squared distance 
among the n points obtained from each camera image. 

Pi = TiRr'^Zp^i + Pc,i, i = 1,2, ...,n 

P - I V P 

n ^ 

i 

minimize 

Pmean \ \ 
i 

3.5. Image rectification & dense reconstruction 


4. Experiment 

4.1. Two cameras reconstruction 

In this section, we show the reconstruction results for a 
2-cameras settings. The two raw images are shown in Fig. 

m 



Figure 6. 2-camera raw image 


Now, we want to go one step further from sparse re¬ 
construction to dense reconstruction. In order to achieve 
a dense reconstruction, we need to rectify the image pairs 
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4.1.1 Ground truth point matching 

In order to implement the eight-point algorithm to compute 
the fundamental matrix, we manually labeled ground truth 









































point correspondences on circular images, shown in Fig. 
For each view, we labeled around 45 pairs of corresponding 
points, which are typically on the ceiling or on the walls, 
thus easy to recognize. There are also several points around 
the desk, such as the corner of the computer. 



Figure 7. Ground truth point correspondences 


4.1.2 SIFT point matcing 

We also tried to extract point correspondences using SIFT, 
then estimate F automatically using RANSAC. However, 
due to the large amount of outliers, this approach is not ro¬ 
bust enough. Therefore, we proposed another pipeline — 
use the ground truth F to filter SIFT matching results, and 
add those correspondences into our correspondences pool 
to achieve a denser reconstruction result. In our project, we 
used the SIFT implementation in VLFeat toolbox 0 for 
extracting features and performing point matching. We ap¬ 
plied point mathcing on both raw images and cubic images 
achieved by cube mapping j^. Fig. [^shows a rough match¬ 
ing result using cubic images. 


4.1.3 3D reconstruction 


Using the ground truth fundamental matrix F, we calcu¬ 
lated ei, 62 and R. Then, we used the triangulation method 
we proposed in Sec. 3.2 to recover points’ position in 3D 
space. The reconstruction result is shown in Fig. 


4.2. Multiple cameras reconstruction 

Now, we show the reconstruction result using pictures 
captured at 6 different loacations, which is equivalent to 6 
cameras. We manually select 12 corresponding points on 
each of the 6 images. For each pair of image we calculate 
the fundamental matrix F and epipoles ei, 62 . Then, we 
calculated the rotation and position of each camera. The po¬ 
sition of the cameras is shown in Fig. Once the rotation 
and position of each camera is obtained, we can triangulate 
the corresponding points as well as rectify each pair of im¬ 
ages. The 3D reconstruction for the 12 points is shown in 
Fig. The result matches well with the ground truth. 



Figure 8. SIFT matching result on cubic images 


4.3. Disparity map & dense reconstruction 

After rectification, the corresponding points are at the 
same longitude with each other. So after transforming the 
raw image into longitude-latitude image, we can use the tra¬ 
ditional method to find the corresponding pairs in the im¬ 
ages. The calculated disparity map is shown in Fig. 
together with the two rectified images. The brighter part 
means smaller disparity and the darker part indicates larger 
disparity. As we can see, the image have roughly presented 
the deapth information. While since the rectified images 
still have distortion, the disparity map may have noise. The 
reconstruction result is shown in Fig. Although the 
result looks a little messy, we can see the closet are recon¬ 
structed fairly well. 

4.4. GUI implementation 

A graphical user interface (GUI) is developed using the 
6 -view dataset. You can run demo s . m to see the demon¬ 
stration. Fig. gives a brief illustration of the 6 views 
obtained by user control. 

5. Conclusion 

In this project we implemented 3D reconstruction algo¬ 
rithm for multiple spherical images. We obtained our data 
using Ricoh Theta full view fisheye camera. We used both 
manually selected points and SIFT matching points to es¬ 
timate fundamental matrix for each pair of images. Then, 
we calculated epipoles, the rotation and the position of each 
camera. Based on these information we implemented sparse 
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manually labeled points 



man ually lab el e d p oints pi us sift mate bin g 



Figure 9. 2-camera reconstruction result: the first image is the re¬ 
construction result using only hand-picked points, the second is 
the reconstruction result augmented by SIFT. 



Figure 10. Camera positions in multiple reconstruction 


3D reconstruction, the result matches well with the ground 
truth. We also developed a user interface to enable users 
to interactively view multiple correlated 360° images. Our 
project is an important step towards building virtual tour 
from large number of full view images. 



Figure 11. Sample points illustration & results 



Figure 12. Rectified images & disparity map 


6. Future Work 

There are two things we want to improve in the future. 
The first is to enhance the algorithm of generating disparity 
map. The second is the robustness of SIFT matching in var¬ 
ious datasets. Currently the performance of SIFT matching 
fluctuates between different image sets. In outdoor images, 
SIFT matching performance tends to deteriorate, the reason 
could be that camera centers are too far apart thus image 
pairs differ too much, or that buildings tend to have repet¬ 
itive features like arches, windows, etc. We could improve 
the image capturing behaviours and select more appropriate 
scenes to get a better performance. 
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dense reconslrudion result 



Figure 13. Dense reconstruction result 



Figure 14. GUI demo 
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